I believe I heard about this course from the mailing list of the doctoral school. I expect to brush up my knowledge on some of the statistical methods including PCA and factor analysis on this course. I have used R before. This file exists in my Github repository.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 26 16:36:39 2020"
The text continues here.
I decided to start with the exercises and read the book if needed. I also decided to use Tidyverse (which includes ggplot2) and tidyr as I use them whenever possible when doing something with R. The e1071 library is needed for calculating kurtosis.
The data set used seems to contain results for a survey done for an introductory university course on social statistics. The variables are three averages calculated over answers to a subset of the questions each, an “attitude towards statistics” value determined from answers to some questions, as well as the gender, the age of the student and a total score calculated from the questions. According to comments in the original data set, the three derived variables indicate “surface approach” (surf), “strategic approach” (stra) and “deep approach” (deep.)
Below, the data is loaded and gender is transformed to an integer in order to make calculating a correlation matrix possible.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(e1071)
df2 <- read_tsv("data/learning2014.tsv", col_types = cols(
.default = col_double(),
gender = col_character(),
Age = col_integer(),
Attitude = col_integer(),
Points = col_integer()
))
df2 <- df2 %>% mutate(gender_ = recode(gender, M = -1, F = 1))
Below are mean, median minimum and maximum values for each of the variables, as well as the counts of students by gender:
df3 <- df2 %>% select(-c(gender, gender_))
df3 %>% summarise_all(tibble::lst(mean))
## # A tibble: 1 x 6
## Age_mean Attitude_mean deep_mean stra_mean surf_mean Points_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 25.5 31.4 3.68 3.12 2.79 22.7
df3 %>% summarise_all(tibble::lst(median))
## # A tibble: 1 x 6
## Age_median Attitude_median deep_median stra_median surf_median Points_median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22 32 3.67 3.19 2.83 23
df3 %>% summarise_all(tibble::lst(min))
## # A tibble: 1 x 6
## Age_min Attitude_min deep_min stra_min surf_min Points_min
## <int> <int> <dbl> <dbl> <dbl> <int>
## 1 17 14 1.58 1.25 1.58 7
df3%>% summarise_all(tibble::lst(max))
## # A tibble: 1 x 6
## Age_max Attitude_max deep_max stra_max surf_max Points_max
## <int> <int> <dbl> <dbl> <dbl> <int>
## 1 55 50 4.92 5 4.33 33
df2 %>% select(gender) %>% group_by(gender) %>% tally
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 110
## 2 M 56
Below are histograms of the values.
scores_long <- df2 %>% select(c(deep, stra, surf)) %>% gather()
ggplot(scores_long, aes(x = value, fill = key)) +
geom_histogram(position = "dodge") +
xlim(1, 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing missing values (geom_bar).
ggplot(df2, aes(x = Points)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df2, aes(x = Age)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data look reasonably Gaussian, except for age, which clearly is not. Below the kurtoses of the variables are calculated to given an estimate on the gaussianity.
df2 %>% select(-c(gender, gender_)) %>% summarise_all(tibble::lst(kurtosis)) %>% t
## [,1]
## Age_kurtosis 3.2396418
## Attitude_kurtosis -0.4788101
## deep_kurtosis 0.6615846
## stra_kurtosis -0.4527415
## surf_kurtosis -0.2714614
## Points_kurtosis -0.2625437
The values for surf and points are close to zero and values for attitude, deep and stra are reasonably close, so the variables seem Gaussian.
Below is a correlation matrix calculated from the variables. The matrix is calculated using R’s built-in implementation for Spearman’s rank correlation coefficient, which I decided to use because some of the variables are discrete.
cor(df2 %>% select(gender_, Age, Attitude, Points, surf, stra, deep), method = "spearman")
## gender_ Age Attitude Points surf stra
## gender_ 1.00000000 -0.19934006 -0.30088840 -0.09763168 0.1194377 0.14753714
## Age -0.19934006 1.00000000 -0.00309789 0.06085512 -0.1530342 0.05065326
## Attitude -0.30088840 -0.00309789 1.00000000 0.44074335 -0.1673798 0.07755921
## Points -0.09763168 0.06085512 0.44074335 1.00000000 -0.1409066 0.16950108
## surf 0.11943771 -0.15303419 -0.16737980 -0.14090656 1.0000000 -0.16145209
## stra 0.14753714 0.05065326 0.07755921 0.16950108 -0.1614521 1.00000000
## deep -0.06842895 0.02059799 0.09167961 -0.01743412 -0.3209585 0.10575868
## deep
## gender_ -0.06842895
## Age 0.02059799
## Attitude 0.09167961
## Points -0.01743412
## surf -0.32095846
## stra 0.10575868
## deep 1.00000000
From the correlation matrix it can be seen that there is a less-than-moderate positive correlation between attitude and total points. From the value, as well as by plotting the data points with a regression line, it may be seen that the correlation is not significant:
ggplot(df2, aes(x = Attitude, y = Points)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Total points against attitude score")
## `geom_smooth()` using formula 'y ~ x'
There is also a less-than-moderate negative correlation between gender and attitude, which indicates that males may have a more positive attitude to statistics.
df2 %>% group_by(gender) %>% summarise(`Median attitude` = median(Attitude))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## gender `Median attitude`
## <chr> <dbl>
## 1 F 29.5
## 2 M 34
Additionally, there is a less-than-moderate negative correlation between surf and deep.
ggplot(df2, aes(x = surf, y = deep)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("surf versus deep")
## `geom_smooth()` using formula 'y ~ x'
At this point, I decided to check the Datacamp exercises. Here is a plot with the student’s attitude versus exam points coloured by gender as done in the DataCamp exercise:
ggplot(df2, aes(x = Attitude, y = Points, colour = gender)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Total points against attitude score")
## `geom_smooth()` using formula 'y ~ x'
I could not find a simple way to do this with ggplot, so here is a solution similar to the Datacamp exercirses. I chose attitude, surf and stra as the explanatory variables as they had the highest absolute values for correlation with points.
model <- lm(Points ~ Attitude + surf + stra, data = df2)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## surf -0.58607 0.80138 -0.731 0.46563
## stra 0.85313 0.54159 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results indicate (in the Estimate column) that of the three variables, stra has the biggest effect on points. On the other hand, attitude is most likely to have an effect on the points, which may be seen from the probability in the last column. As the probability of getting a t value as high or higher when the null hypothesis of the coefficient being zero is true is low enought only for the attitude, it seems that only attitude from the three variables in question may be used to predict the number of points. In effect, when attitude increases by one unit, the prediction is that points increase by 0.34 units.
In general, a linear regression model assumes that the target variable is a linear combination of the model parameters. Further, it is assumed that the errors are normally distributed and have constant variance.
The generated model assumes that total points are best explained by attitude, surf and stra. In other words, deep would not effect the total score. (The summary of the model indicates, though, that attitude is significant while surf and stra are not.)
plot(model, c(1, 2, 5))
As the residuals shown in the Q-Q plot mostly follow the line, the errors seem to be Gaussian. The residuals vs. fitted plot shows no discernible pattern, which would indicate the errors having constant variance. The residuals vs leverage plot indicates that no single observation has exceptionally high impact on the model, as the leverage value of each observation is relatively small.
I am using the whole Tidyverse again, not just ggplot2 or dplyr, as well as tidyr.
library(tidyverse)
library(tidyr)
library(questionr)
library(boot)
df0 <- read_tsv("data/student_alcohol_use.tsv", col_types = cols(
.default = col_character(),
age = col_integer(),
Medu = col_integer(),
Fedu = col_integer(),
traveltime = col_integer(),
studytime = col_integer(),
failures = col_integer(),
famrel = col_integer(),
freetime = col_integer(),
goout = col_integer(),
Dalc = col_integer(),
Walc = col_integer(),
health = col_integer(),
absences = col_integer(),
G1 = col_integer(),
G2 = col_integer(),
G3 = col_integer(),
alc_use = col_double(),
high_use = col_logical()
))
I did the Datacamp exercises but forgot what the result was. My guess or recollection of some things that are associated with high alcohol consumption is:
Below is a summary of the data.
summary(df0)
## school sex age address
## Length:382 Length:382 Min. :15.00 Length:382
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:382 Length:382 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :3.000
## Mean :2.806 Mean :2.565
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason nursery
## Length:382 Length:382 Length:382 Length:382
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## internet guardian traveltime studytime
## Length:382 Length:382 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Mode :character Median :1.000 Median :2.000
## Mean :1.448 Mean :2.037
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## failures schoolsup famsup paid
## Min. :0.0000 Length:382 Length:382 Length:382
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :0.0000 Mode :character Mode :character Mode :character
## Mean :0.2016
## 3rd Qu.:0.0000
## Max. :3.0000
## activities higher romantic famrel
## Length:382 Length:382 Length:382 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.937
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.000 Median :4.000
## Mean :3.22 Mean :3.113 Mean :1.482 Mean :2.296 Mean :3.573
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## absences G1 G2 G3 alc_use
## Min. : 0.0 Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.007
## 1st Qu.: 1.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.007
## Median : 3.0 Median :12.00 Median :12.00 Median :12.00 Median :1.270
## Mean : 4.5 Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.482
## 3rd Qu.: 6.0 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:1.584
## Max. :45.0 Max. :18.00 Max. :18.00 Max. :18.00 Max. :3.296
## high_use
## Mode :logical
## FALSE:307
## TRUE :75
##
##
##
Below are histograms of the variables.
ggplot(df0, aes(x = sex, fill = high_use)) + geom_bar() + ggtitle("Sex vs. high alcohol use")
ggplot(df0, aes(x = freetime, fill = high_use)) + geom_bar() + ggtitle("Histogram for freetime with high alcohol use")
ggplot(df0, aes(x = activities, fill = high_use)) + geom_bar() + ggtitle("Extra-curricular activities vs. high alcohol use")
Please see below for a plot for the parents’ jobs vs. high alcohol use.
I decided to transform the variables to numeric and calculate a correlation matrix. Here, binary variables are transformed to {-1, 1}. Other categorical measurements are transformed to (at least) ordinal if possible.
The variables Mjob, Fjob, reason and guardian are categorical and difficult to transform, so I plotted them instead.
df1 <- df0 %>% mutate(
high_use_ = Vectorize(function(val) { if (val) return (1) else { return (-1) }})(high_use)
)
df2 <- df1 %>% mutate(
school_ = recode(school, GP = -1, MS = 1),
sex_ = recode(sex, M = -1, F = 1),
address_ = recode(address, U = -1, R = 1),
famsize_ = recode(famsize, LE3 = -1, GT3 = 1),
Pstatus_ = recode(Pstatus, T = -1, A = 1),
m_at_home = recode(Mjob, at_home = 1, .default = -1),
f_at_home = recode(Fjob, at_home = 1, .default = -1),
schoolsup_ = recode(schoolsup, yes = 1, no = -1),
famsup_ = recode(famsup, yes = 1, no = -1),
paid_ = recode(paid, yes = 1, no = -1),
activities_ = recode(activities, yes = 1, no = -1),
nursery_ = recode(nursery, yes = 1, no = -1),
higher_ = recode(higher, yes = 1, no = -1),
internet_ = recode(internet, yes = 1, no = -1),
romantic_ = recode(romantic, yes = 1, no = -1)
) %>%
select(-c(
school,
sex,
address,
famsize,
Pstatus,
Mjob,
Fjob,
reason,
guardian,
schoolsup,
famsup,
paid,
activities,
nursery,
higher,
internet,
romantic
))
cor(
x = df2 %>% select(-c(high_use, high_use_)),
y = df2 %>% select(high_use_),
method = "spearman"
)
## high_use_
## age 0.05803437
## Medu 0.01693618
## Fedu 0.01417487
## traveltime 0.05407731
## studytime -0.22656222
## failures 0.11573316
## famrel -0.09439361
## freetime 0.13833505
## goout 0.35356847
## Dalc 0.54651084
## Walc 0.71476866
## health 0.11636775
## absences 0.17969439
## G1 -0.16759064
## G2 -0.14964691
## G3 -0.13927400
## alc_use 0.71476866
## school_ -0.03989242
## sex_ -0.27531819
## address_ 0.04992972
## famsize_ -0.06782646
## Pstatus_ 0.05591147
## m_at_home 0.01132896
## f_at_home -0.10334296
## schoolsup_ -0.05838166
## famsup_ -0.09148788
## paid_ 0.01650270
## activities_ 0.02028154
## nursery_ -0.11566069
## higher_ -0.10779565
## internet_ 0.02547942
## romantic_ -0.02488340
df3 <- df1 %>% select(high_use, Mjob, Fjob) %>% gather(key, value, -c(high_use))
ggplot(df3, aes(x = value, y = key, colour = high_use)) + geom_jitter(alpha = 0.5)
The figure does not seem to be very telling. Here, frequencies for high alcohol use are counted for Mjob and Fjob.
df1 %>% select(high_use, Mjob) %>% count(high_use, Mjob) %>% group_by(Mjob) %>% mutate(prop = prop.table(n))
## # A tibble: 10 x 4
## # Groups: Mjob [5]
## high_use Mjob n prop
## <lgl> <chr> <int> <dbl>
## 1 FALSE at_home 42 0.792
## 2 FALSE health 26 0.788
## 3 FALSE other 114 0.826
## 4 FALSE services 77 0.802
## 5 FALSE teacher 48 0.774
## 6 TRUE at_home 11 0.208
## 7 TRUE health 7 0.212
## 8 TRUE other 24 0.174
## 9 TRUE services 19 0.198
## 10 TRUE teacher 14 0.226
df1 %>% select(high_use, Fjob) %>% count(high_use, Fjob) %>% group_by(Fjob) %>% mutate(prop = prop.table(n))
## # A tibble: 9 x 4
## # Groups: Fjob [5]
## high_use Fjob n prop
## <lgl> <chr> <int> <dbl>
## 1 FALSE at_home 16 1
## 2 FALSE health 13 0.765
## 3 FALSE other 168 0.796
## 4 FALSE services 82 0.766
## 5 FALSE teacher 28 0.903
## 6 TRUE health 4 0.235
## 7 TRUE other 43 0.204
## 8 TRUE services 25 0.234
## 9 TRUE teacher 3 0.0968
Below, frequencies are calculated for reason and guardian, too.
df1 %>% select(high_use, reason) %>% count(high_use, reason) %>% group_by(reason) %>% mutate(prop = prop.table(n))
## # A tibble: 8 x 4
## # Groups: reason [4]
## high_use reason n prop
## <lgl> <chr> <int> <dbl>
## 1 FALSE course 111 0.793
## 2 FALSE home 87 0.791
## 3 FALSE other 24 0.706
## 4 FALSE reputation 85 0.867
## 5 TRUE course 29 0.207
## 6 TRUE home 23 0.209
## 7 TRUE other 10 0.294
## 8 TRUE reputation 13 0.133
df1 %>% select(high_use, guardian) %>% count(high_use, guardian) %>% group_by(guardian) %>% mutate(prop = prop.table(n))
## # A tibble: 6 x 4
## # Groups: guardian [3]
## high_use guardian n prop
## <lgl> <chr> <int> <dbl>
## 1 FALSE father 73 0.802
## 2 FALSE mother 220 0.8
## 3 FALSE other 14 0.875
## 4 TRUE father 18 0.198
## 5 TRUE mother 55 0.2
## 6 TRUE other 2 0.125
Somehow there seems to be a possible connection between choosing the school for “other” reasons and high alcohol use. Below the correlation is calculated to make comparing easier.
cor(
x = df1 %>% select(high_use_),
y = df1 %>% select(reason) %>% mutate(reason = recode(reason, other = 1, .default = -1)),
method = "spearman"
)
## reason
## high_use_ 0.07694401
The correlation is negligible.
To conclude, there are correlations between high alcohol use and the following:
Even these are rather weak, though. Of the variables I guessed, only sex had a connection with high alcohol use according to the correlation coefficient.
The way I read the exercise I am supposed to keep using the same variables that I guessed.
Here is the model:
df4 <- df0 %>% mutate(
m_at_home = recode(Mjob, at_home = TRUE, .default = FALSE),
f_at_home = recode(Fjob, at_home = TRUE, .default = FALSE),
activities = Vectorize(function(val) { if ("yes" == val) return (TRUE) else { return (FALSE) }})(activities)
) %>%
mutate(
either_at_home = m_at_home | f_at_home
)
m <- glm(high_use ~ sex + freetime + activities + either_at_home, data = df4, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + freetime + activities + either_at_home,
## family = "binomial", data = df4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0803 -0.8047 -0.4566 -0.3717 2.2992
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.03111 0.53269 -5.690 1.27e-08 ***
## sexM 1.44557 0.30430 4.751 2.03e-06 ***
## freetime 0.23085 0.14067 1.641 0.101
## activitiesTRUE -0.06856 0.27198 -0.252 0.801
## either_at_homeTRUE 0.19863 0.38420 0.517 0.605
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.39 on 381 degrees of freedom
## Residual deviance: 345.40 on 377 degrees of freedom
## AIC: 355.4
##
## Number of Fisher Scoring iterations: 5
odds.ratio(m, level = 0.95)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % p
## (Intercept) 0.048262 0.016312 0.1323 1.269e-08 ***
## sexM 4.244258 2.379180 7.8877 2.029e-06 ***
## freetime 1.259676 0.959057 1.6672 0.1008
## activitiesTRUE 0.933734 0.547472 1.5948 0.8010
## either_at_homeTRUE 1.219728 0.553964 2.5322 0.6052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model does not look convincing at all (only the student being male seems significant). Furthermore, the confidence intervals of the odds ratios of all the other variables contain the value one, which suggests that they are independent from high alcohol use. To fix this, I chose to use sex and study time as explaining variables.
m1 <- glm(high_use ~ sex + studytime, data = df4, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ sex + studytime, family = "binomial",
## data = df4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0033 -0.8096 -0.4519 -0.3515 2.3727
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1851 0.4569 -2.594 0.00949 **
## sexM 1.2833 0.3047 4.211 2.54e-05 ***
## studytime -0.5227 0.1900 -2.751 0.00594 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 378.39 on 381 degrees of freedom
## Residual deviance: 340.03 on 379 degrees of freedom
## AIC: 346.03
##
## Number of Fisher Scoring iterations: 5
odds.ratio(m1, level = 0.95)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % p
## (Intercept) 0.30572 0.12350 0.7440 0.009487 **
## sexM 3.60860 2.01699 6.6993 2.538e-05 ***
## studytime 0.59294 0.40259 0.8493 0.005936 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary indicates that the student being male increases the log odds of high alcohol use by approximately 1.28. Each unit increase in study time decreases the log odds of high alcohol use by approximately 0.5. The p values (in the last column) indicate that the former is significant and the latter is somewhat significant. Additionally the confidence intervals of the odds ratios of both of the variables do not contain the value one and hence the variables may be connected to high alcohol use.
Below the predictive power of the model is tested.
probabilities <- predict(m1, type = "response")
df4 <- df4 %>%
mutate(probability = probabilities) %>%
mutate(prediction = (0.5 < probability))
table(high_use = df4$high_use, prediction = df4$prediction)
## prediction
## high_use FALSE
## FALSE 307
## TRUE 75
For some reason the model did not work at all; its performance was equivalent to guessing that no one of the students is classified as using much alcohol. The training error is 75 / 382 ≈ 0.196.
Below 10-fold cross-validation is tested.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
cv <- cv.glm(data = df4, cost = loss_func, glmfit = m1, K = 10)
cv$delta
## [1] 0.1963351 0.1963351
On one hand the prediction error is in fact lower than in Datacamp’s model. On the other hand, the usefulness of either such a model (or both) or the metric seems questionable because by guessing one can seemingly get a better result.
Below the Boston data is loaded. The data set contains various statistics from some 30 years ago including per capita crime rate by town, nitrogen oxides concentration (parts per 10 million) and pupil-teacher ratio by town. The variables seem continuous, so I used the Pearson correlation coefficient.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.84 loaded
library(e1071)
library(tidyverse) # Load last in order to use dplyr::select by default.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
corrplot.mixed(cor(Boston), tl.pos = "d", tl.col = "black", tl.cex = 0.6, number.cex = 0.6)
Some correlations may be seen, including:
Below the kurtoses of the variables are calculated to determine whether they are Gaussian.
Boston %>% summarise_all(tibble::lst(kurtosis)) %>% t
## [,1]
## crim_kurtosis 36.59581589
## zn_kurtosis 3.95238731
## indus_kurtosis -1.24019490
## chas_kurtosis 9.48197035
## nox_kurtosis -0.08741064
## rm_kurtosis 1.84183241
## age_kurtosis -0.97802966
## dis_kurtosis 0.45759158
## rad_kurtosis -0.87892910
## tax_kurtosis -1.15031761
## ptratio_kurtosis -0.30480100
## black_kurtosis 7.10371496
## lstat_kurtosis 0.46281705
## medv_kurtosis 1.45098366
From the values, it may be seen that some of the values are too peaked to be Gaussian, including per capita crime rate by town (crim) and the ethnicity of the inhabitants (black). Some, on the other hand, are too flat, including proportion of non-retail business acres per town (indus) and value property-tax rate per $10,000 (tax).
Below the data is standardised like in the DataCamp exercise. After scaling the mean of each variable is zero. The crime rate is also converted to a categorical variable.
boston_ <- Boston %>% scale %>% data.frame %>% tibble
crime_bins <- quantile(boston_$crim)
crime <- cut(boston_$crim, breaks = crime_bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
boston_$crim = crime
Below, training and test data sets are created by selecting 80% of the data by random to the former.
n <- nrow(boston_)
idxs <- sample(n, size = n * 0.8)
training <- boston_[idxs,]
testing <- boston_[-idxs,]
Next, linear discriminant analysis is done.
model <- lda(crim ~ ., data = training)
classes <- as.numeric(training$crim)
plot(model, dimen = 2, col = classes, pch = classes)
Below, the model is used to predict the crime rates in the testing data set.
actual_classes <- testing$crim
testing <- testing %>% select(-crim)
prediction <- predict(model, newdata = testing)
table(correct = actual_classes, predicted = prediction$class)
## predicted
## correct low med_low med_high high
## low 10 11 1 0
## med_low 6 14 9 0
## med_high 2 5 19 1
## high 0 0 0 24
By looking at the cross tabulation it seems that the model can predict crime rates quite well.
Below, the Euclidean distance matrix is calculated and clustering is done. R’s implementation of K-means seems to calculate the distances by itself, so there is not much use for the precalculated distances in this case.
boston_ <- Boston %>% scale %>% data.frame %>% tibble
dist_mat <- dist(boston_)
summary(dist_mat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km1 <- kmeans(boston_, centers = 4)
do_plot <- function (data, km) {
ggpairs(
data,
mapping = ggplot2::aes(colour = as.factor(km$cluster)),
upper = list(continuous = wrap("cor", size = 2.0)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.3)),
diag = list(continuous = wrap("densityDiag", size = 0.3)),
combo = wrap("dot", alpha = 0.4, size = 0.4)
) +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
axis.ticks = element_line(colour = "black", size = 0.3),
text = element_text(size = 10)
)
}
do_plot(boston_, km1)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
Just by looking at the plots it seems that there are four discernible clusters in none of the cases. (I think dimensionality reduction might be a good idea.) Below the WCSS is calculated to determine a good number of clusters.
set.seed(43)
max_clusters <- 10
twcss <- sapply(1:max_clusters, function(k) { kmeans(boston_, centers = k)$tot.withinss})
df1 <- cbind(1:max_clusters, twcss) %>% as.data.frame %>% tibble
colnames(df1) <- c("Clusters", "TWCSS")
ggplot(df1, aes(x = Clusters, y = TWCSS)) + geom_line() + geom_point() + scale_x_continuous(breaks = 1:max_clusters)
Based on the WCSS it seems that two clusters would be the best choice.
km2 <- kmeans(boston_, centers = 2)
do_plot(boston_, km2)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
The data set was already standardised for clustering. Below, k-means is run with three clusters and LDA is used to generate a model. The original function for drawing the arrows for the LDA is from a Stack Overflow answer.
km3 <- kmeans(boston_, centers = 3)
boston__ <- boston_ %>% mutate(Cluster = km3$cluster)
model2 <- lda(Cluster ~ ., data = boston__)
lda.arrows <- function(x, x0, y0, myscale = 1, tex = 0.75, choices = c(1,2), ...){
## adds `biplot` arrows to an lda using the discriminant function values
heads <- coef(x)
arrows(x0 = x0, y0 = y0,
x1 = myscale * (x0 + heads[,choices[1]]),
y1 = myscale * (y0 + heads[,choices[2]]), ...)
text(x = myscale * (x0 + heads[,choices[1]]), y = myscale * (y0 + heads[,choices[2]]), labels = row.names(heads),
cex = tex)
}
plot(model2, dimen = 2, col = boston__$Cluster, pch = boston__$Cluster)
lda.arrows(model2, x0 = rep(1:5 - 2, 3)[1:14], y0 = rep(0:2, 5)[1:14], myscale = 1, length = 0.1)
Based on the plots it seems that the following are (some of) the most influencial linear separators for the clusters:
In the two R sections below is the code from the exercise (slightly modified), as well as the plots. I assume the same model predictors are supposed to be used with k-means. To determine a suitable k, I calculated TWCSS again.
model_predictors <- dplyr::select(training, -crim)
max_clusters <- 6
twcss2 <- sapply(1:max_clusters, function(k) { kmeans(model_predictors, centers = k)$tot.withinss})
df2 <- cbind(1:max_clusters, twcss2) %>% as.data.frame %>% tibble
colnames(df2) <- c("Clusters", "TWCSS")
ggplot(df2, aes(x = Clusters, y = TWCSS)) + geom_line() + geom_point() + scale_x_continuous(breaks = 1:max_clusters)
Removing crime rate did not seem to notably affect TWCSS. Since there are four levels of crime, I used four centers for k-means.
# Check the dimensions
dim(model_predictors)
## [1] 404 13
dim(model$scaling)
## [1] 13 3
# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)
km4 <- kmeans(model_predictors, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', size = 0.3, alpha = 0.3, color = training$crim)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', size = 0.3, alpha = 0.3, color = as.factor(km4$cluster))
In this dimensionality reduction, there is one clearly separate cluster around the point x = 7, y = 0, z = 0. This was better separated in the first plot. K-means found some of the points in the larger cluster to be closer to those in the previously mentioned cluster. In both plots the larger cluster seems to consist of three parts the shapes of which seem to match.
I saved the data in an rds file. Below, the data is loaded and the name of one variable clarified.
library(tidyverse)
library(GGally)
library(e1071)
human <- read_rds("data/human.rds")
ggpairs(
human,
upper = list(continuous = wrap("cor", size = 3.0)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.3)),
diag = list(continuous = wrap("densityDiag", size = 0.3)),
combo = wrap("dot", alpha = 0.4, size = 0.4)
) +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
axis.ticks = element_line(colour = "black", size = 0.3),
text = element_text(size = 7)
)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'combo' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
Some of the strong correlations include:
Below the kurtoses of the variables are calculated to determine whether they are Gaussian.
human %>% summarise_all(tibble::lst(kurtosis)) %>% t
## [,1]
## Secondary education ratio F/M_kurtosis 0.5450204
## Labour force participation ratio F/M_kurtosis 0.0498217
## Expected_Years_of_Education_kurtosis -0.3438440
## Life_Expectancy_at_Birth_kurtosis -0.1496657
## Gross_National_Income_GNI_per_Capita_kurtosis 6.8292110
## Maternal_Mortality_Ratio_kurtosis 4.1565597
## Adolescent_Birth_Rate_kurtosis 0.8900140
## Percent_Representation_in_Parliament_F_kurtosis -0.1031777
Especially GNI and maternal mortality ratio are too peaked to be Gaussian.
Below the PCA is done on non-standardised data.
pca1 <- prcomp(human)
biplot(pca1, choices = 1:2, cex = c(0.5, 0.8), main = "PCA done without scaling the data first")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Below, the standardisation is done first.
pca2 <- prcomp(scale(human))
biplot(pca2, choices = 1:2, cex = c(0.5, 0.8), main = "PCA done after scaling the data")
The variable names are self-explanatory. The partially obfuscated variable names on the left are Expected_Years_of_Education, GNI per capita, Secondary education ratio F/M, and Life_Expectancy_at_Birth.
Since principal components are linear combinations that contain as much of the variance of the input data as possible, not scaling the variables can lead to some variables with naturally broad scales dominating. The second set of results is in that sense more accurate.
According to the plot, the first principal component consists of the following variables:
The first four are positively correlated with each other, as are the latter two. Variables in one of the groups are negatively correlated with variables in another. The second principal component consists of labour force participation ratio F/M and percent representation in parliament F.
Below the tea data is loaded from FactoMineR. A summary of the variables is also shown.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
I decided to keep the first 18 variables for MCA. I marked age as a supplementary variable since it is not a factor.
mca <- MCA(tea[, 1:18], graph = FALSE)
plot(mca, invisible = c("ind"), habillage = "quali")
Since the plot shows similar variables together, some observations about the factors may be made: